Demo script for PathAnalyser R package which provides functionality for assessing ER and HER2 pathway activity in breast cancer transcriptomic datasets.

Installation and setup

Installing dependencies

# If not already installed
install.packages("BiocManager")

BiocManager::install(c("GSVA", "pROC", "edgeR", "reshape2", "ggplot2","limma", 
                       "VennDiagram", "NCmisc", "futile.logger"), 
                     dependencies = TRUE)

Installing package using devtools

install.packages("devtools")
devtools::install_github("a-thind/PathAnalyser")

Alternatively, you can clone the GitHub repository in a Linux / MacOS terminal:

git clone git@github.com:a-thind/PathAnalyser.git

Then type the following in R to install a local source version of the package:

library(utils)
install.packages("~/PathAnalyser", repo=NULL, type="source")

Once installation is completed, load the library to start using PathAnalyser:

library(PathAnalyser)

The read_expression_data function reads gene expression data from
a file with an delimiter, the function checks the delimiter of the file and reads the file accordingly. It returns an output with gene symbols/IDs as the row names and columns representing each sample IDs in the dataset.

ER_expression_set <- read_expression_data("../raw_data/toy_data.txt")
# ER_expression_set<-ER_expression_set[,1:200]
# print(ER_expression_set[1:5, 1:5])

ER_expression_set contains RNASeq raw read counts for 20 primary breast cancer samples, 10 of which have ER pathway activity (ER+) and 10 which have inactive ER pathway activity (ER-).

dim(ER_expression_set)
#> [1] 20124    20
# Expression data for first 6 genes
head(ER_expression_set)
#>         TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           16841         6974         8211         5586        12210
#> TMEM107          189          851          581          433         1298
#> CAMK2A            10           20           18           27           19
#> TRAM1L1          233          198          426          328         1044
#>         TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            8627         9138         3219         7663         6867
#> TMEM107          629          253          226         1265          385
#> CAMK2A            30           47           72           45           18
#> TRAM1L1          385          467          153          342           33
#>         TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4416         2869          718        14649         7106
#> TMEM107          332          660          230         1738          319
#> CAMK2A            29           34           25            4           48
#> TRAM1L1          212          482           28          428          507
#>         TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            6351         3150         5278         4839         9226
#> TMEM107          633          234          245          284          565
#> CAMK2A            32           34           23           43            9
#> TRAM1L1          169           71          220          272          615

Gene signature files can be read by the read_signature_data function. It returns a dataframe with signature IDs/symbols in the first column and their expression pattern is represented as up r regulated or down regulated by +1 and -1 respectively.

er_sign_df <- read_signature_data("../raw_data/ESR1_UP.v1._UP.grp", "../raw_data/ESR1_DN.v1_DN.grp")

dim(er_sign_df)
#> [1] 160   2
head(er_sign_df)
#>    genes expression
#> 1   ABAT          1
#> 2 ACADSB          1
#> 3  ADCY1          1
#> 4  ADCY9          1
#> 5   AGR2          1
#> 6  ANXA9          1
# up-regulated genes are given an expression value of 1
er_up_sign <- er_sign_df[er_sign_df$expression == 1 ,]
dim(er_up_sign)
#> [1] 101   2
# Down-regulated genes are given an expression value of -1
er_dn_sign <- er_sign_df[er_sign_df$expression == -1 ,]
dim(er_dn_sign)
#> [1] 59  2

Quality control and pre-processing of data

Log Counts Per Million (CPM) transformation of RNASeq expression matrices

The log_cpm_transformation function transforms the raw data by the method of log CPM and returns the transformed data. It also performs sanity check of the transformation by returning box plots for visualization of distribution of data before and after transformation.

# using toy data set as expression matrix
data_se <- HER2_data_se1 
normalized_se <- log_cpm_transformation(ER_expression_set)

Checking and filtering genes from the gene expression matrix

The gene names are first checked for inclusion in the gene signature data frame. Genes that are included in the gene signature are retained then subjected to further filtration, in which their expression across the data set measured. If a gene is not present in at least 10% of the total number of samples in the data set, the gene is dropped from the expression matrix. A console message is printed reporting the number of features (genes) retained in the final normalized expression matrix.

normalized_se <- check_signature_vs_dataset(normalized_se, er_sign_df)
#> Number of feature present in expression dataset: 147

Visualising GSVA Score Distribution

gsva_scores_dist method visualizes the GSVA score distribution for the abundance of expression of the up-regulated and down-regulated gene sets of the gene signature for each sample.

gsva_scores_dist(er_sign_df, normalized_se)


# Add thresholds on plot
library(ggplot2)
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.2, 0.2))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL

Shrinking the distance between the high and low expression thresholds would result in fewer “Uncertain” pathway activity classifications

# more relaxed thresholds, fewer uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), 
                           vline=c(-0.1, 0.1))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL

Conversely, expanding the distance between the thresholds would increase the frequency of “Uncertain” pathway activity classifications, as the number of samples meeting the thresholds for consistency and inconsistency would be reduced.

# more stringent thresholds, greater uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.4, 0.4))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
#> NULL

Classification using absolute GSVA score thresholds

classes_df <- classify_GSVA_abs(er_sign_df, normalized_se, 
                                    up_thresh.low=-0.2, up_thresh.high=0.2, 
                                    dn_thresh.low=-0.2, dn_thresh.high=0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         9        10         1 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 19

Classification using percentile GSVA score thresholds

# default percentile = 25% (quartile)
classes_df.percent <- classify_GSVA_percent(er_sign_df, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         4         5        11 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 9
# custom percentile = 30%
classes_df.percent <- classify_GSVA_percent(er_sign_df, normalized_se, 
                                            percent_thresh=30)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         6         6         8 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 12

Evaluating pathway activity classification

confusion_matrix <- calculate_accuracy("../raw_data/Sample_labels.txt", 
                                       classes_df.percent, "ER", display_statistics=TRUE, 
                                       display_roc_curve=TRUE)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                6               0                0
#> Prediction Negative                0               6                0
#> Prediction Uncertain               4               4                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 60.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 6"
#> [1] "True Negative(TN): 6"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0       
#>  Median :4.000   Median :4.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :6.000   Max.   :6.000   Max.   :0

Visualising pathway activity

Generating PCA plot

classes_pca(normalized_se, classes_df.percent, pathway_name = "ER")

Session Info

The output of sessionInfo upon which this file was generated:

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] plotly_4.10.0           pROC_1.18.0             GSVA_1.40.1            
#>  [4] reshape2_1.4.4          ggplot2_3.3.5           edgeR_3.34.1           
#>  [7] limma_3.48.3            reader_1.0.6            NCmisc_1.1.6           
#> [10] PathAnalyser_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7                matrixStats_0.61.0         
#>   [3] bit64_4.0.5                 RColorBrewer_1.1-3         
#>   [5] httr_1.4.2                  GenomeInfoDb_1.28.4        
#>   [7] tools_4.1.1                 bslib_0.3.1                
#>   [9] irlba_2.3.5                 utf8_1.2.2                 
#>  [11] R6_2.5.1                    HDF5Array_1.20.0           
#>  [13] lazyeval_0.2.2              DBI_1.1.2                  
#>  [15] BiocGenerics_0.38.0         colorspace_2.0-3           
#>  [17] rhdf5filters_1.4.0          withr_2.5.0                
#>  [19] tidyselect_1.1.2            bit_4.0.4                  
#>  [21] compiler_4.1.1              graph_1.70.0               
#>  [23] cli_3.2.0                   Biobase_2.52.0             
#>  [25] DelayedArray_0.18.0         labeling_0.4.2             
#>  [27] sass_0.4.1                  scales_1.1.1               
#>  [29] stringr_1.4.0               digest_0.6.29              
#>  [31] rmarkdown_2.13              XVector_0.32.0             
#>  [33] pkgconfig_2.0.3             htmltools_0.5.2            
#>  [35] sparseMatrixStats_1.4.2     MatrixGenerics_1.4.3       
#>  [37] fastmap_1.1.0               highr_0.9                  
#>  [39] htmlwidgets_1.5.4           rlang_1.0.2                
#>  [41] rstudioapi_0.13             RSQLite_2.2.12             
#>  [43] DelayedMatrixStats_1.14.3   jquerylib_0.1.4            
#>  [45] farver_2.1.0                generics_0.1.2             
#>  [47] jsonlite_1.8.0              crosstalk_1.2.0            
#>  [49] BiocParallel_1.26.2         dplyr_1.0.8                
#>  [51] BiocSingular_1.8.1          RCurl_1.98-1.6             
#>  [53] magrittr_2.0.3              GenomeInfoDbData_1.2.6     
#>  [55] Matrix_1.4-1                Rhdf5lib_1.14.2            
#>  [57] Rcpp_1.0.8.3                munsell_0.5.0              
#>  [59] S4Vectors_0.30.2            fansi_1.0.3                
#>  [61] lifecycle_1.0.1             stringi_1.7.6              
#>  [63] yaml_2.3.5                  SummarizedExperiment_1.22.0
#>  [65] zlibbioc_1.38.0             rhdf5_2.36.0               
#>  [67] plyr_1.8.7                  grid_4.1.1                 
#>  [69] blob_1.2.2                  parallel_4.1.1             
#>  [71] crayon_1.5.1                lattice_0.20-45            
#>  [73] beachmat_2.8.1              Biostrings_2.60.2          
#>  [75] annotate_1.70.0             KEGGREST_1.32.0            
#>  [77] locfit_1.5-9.5              knitr_1.38                 
#>  [79] pillar_1.7.0                GenomicRanges_1.44.0       
#>  [81] ScaledMatrix_1.0.0          stats4_4.1.1               
#>  [83] XML_3.99-0.9                glue_1.6.2                 
#>  [85] evaluate_0.15               data.table_1.14.2          
#>  [87] png_0.1-7                   vctrs_0.4.0                
#>  [89] tidyr_1.2.0                 gtable_0.3.0               
#>  [91] purrr_0.3.4                 assertthat_0.2.1           
#>  [93] cachem_1.0.6                xfun_0.30                  
#>  [95] rsvd_1.0.5                  xtable_1.8-4               
#>  [97] viridisLite_0.4.0           SingleCellExperiment_1.14.1
#>  [99] tibble_3.1.6                proftools_0.99-3           
#> [101] AnnotationDbi_1.54.1        memoise_2.0.1              
#> [103] IRanges_2.26.0              ellipsis_0.3.2             
#> [105] GSEABase_1.54.0